The purpose of this simulation is to test the efficacy of distribution regression for predicting individual-level political support from political polling data.
In a typical distribution regression setting, we observe bags of iid samples \(\{\mathbf{x}_i^j\}_{i = 1}^{N_j}\) from distributions, and aggregated outcomes for each bag \(\mathbf{y}^j\). The goal is to regress \(\mathbf{y}^j\) on the distributions of \(X^j\), from which we observe iid samples. In this setting, we observe the outcome at the individual-level, but can’t link the outcome to rich covariate data. Therefore, we aggregate the outcome ourselves as a method for linking outcomes to distributions.
k-means++ to define \(b\) bags of (both matched and unmatched) survey respondentsk-means++.This procedure requires the specification of the following hyperparameters:
There is not much guidance in the literatre for how to set these parameters. In distribution regression problems, the number of bags is generally determined by the data. Landmark points are also commonly used for dimensionality reduction and computational efficiency, so “more is better” (given computaional limitations), however “more” is not defined.
There is more intuition for kernel choice - for which there are more clearly-defined options for given types of covariate data. Here we will consider 3 types of kernels: 1) a simple linear kernel 2) an RBF kernel and 3) a custom kernel that combines a linear kernel for categorical covariates with an RBF kernel for age. The RBF kernels require specification of a bandwidth hyperparameter, \(\sigma\). We set this using the median heuristic (Garreau, Jitkrittum, and Kanagawa 2018). For example, if an RBF kernel is given by \(k(x,x') = \exp\left(-\frac{\|x - x'\|^2}{\sigma}\right)\), we would like to choose \(\sigma\) to have the same magnitude as \(\|x - x'\|^2\), and do this by finding the median of observed \(\|x - x'\|^2\).
This simulation will evaluate performance of distribution regression models across different combinations of these hyperparameters.
In addition to the basic distribution regression outlined above, we implement a weighted version, intended to correct for the imbalance in observed covariates between the matched responses, unmatched responses and voterfile observations used in training. In practice, survey respondents are often more partisan, older, and morer likely to be retired or work in an office job than those who choose not to respond to a survey. Additionally, the respondents that match back to a voterfile are typically more likely to be high-income, white and homeowners than those who don’t match back. Therefore, the covariate distributions of respondents who match back to the voterfile do not match the covariate distributions for the population that we are trying to predict.
In order to correct for this, we implement two layers of Kernel Mean Matching (KMM) (Gretton et al. 2013). We first weight matched respondents such that the weighted covariate distributions match those observed in the file. Then, we use KMM to weight the covariate distributions for unmatched respondents to those of matched respondents. The weighting must be done in two stages because different covariates are observed for different subsets of data. Typically only a few covariates are observed for survey responses, while a much richer set of covariates is observed for voterfile data. The two weighting stages match the kernel means of different sets of observed covariates.
KMM weights can be calculated efficiently by formulating the problem as one of quadratic optimization. We constrain the weights such that the sum is equal to the number of observations being weighted and to be greater than 0 and less than \(B\). \(B\) can be interpreted as the maximum number of units represented by a single weighted observation. If \(B\) is too small, the weighted covariate distributions may still be far from the target covariate distributions. For example, if \(B = 1\) then all weights will be 1 and the covariate distributions will remain unadjusted. However, if \(B\) is too large, then one observation may be given too large a weight, and therefore too much influence over weighted estimates. Here we set \(B = 5\) for the first round of weighting and \(B = 7\) for the second round of weighting. These are based on domain knowledge of what is typically considered an acceptable survey weight.
Another version of the distribution regression model assigns matched responses to their own bags. In effect, the mean embeddings of the matched observations are just the matched observations themselves, so the regularized regression on the kernel mean embeddings is just simply regularized regression on matched observations plus additional “observations” of the kernel mean embeddings of the unmatched data.
The questions we seek to answer in this simulation are:
We will measure performance in two ways:
On each iteration, we will perform the following steps:
There are 12 models in total fit on each iteration. Three are baseline models for comparison:
The rest of the models are distribution regression models that vary in 1) kernel specification 2) whether they are weighted or unweighted 2) whether the matched data is assigned to separate bags.
The models are:
| model_name | kernel | weighted | separate_bags |
|---|---|---|---|
| logit | |||
| logit_alldata | |||
| dr_linear | linear | ||
| wdr_linear | linear | X | |
| dr | rbf | ||
| wdr | rbf | X | |
| dr_cust | custom | ||
| dr_sepbags | rbf | X | |
| wdr_sepbags | rbf | X | X |
| dr_sepbags_lin | rbf | X | |
| dr_sepbags_cust | custom | X | |
| grpmean |
Pew Research conducts regular public opinion research polls measuring poltical attitudes in the US, like this political survey from September 2018. We use 4 surveys fielded over the 6 months leading up to the 2018 US miderm elections that all ask respondents which party they plan to support in the upcoming election, in addition to a selection of demographic variables (e.g. age, income bracket, education, race, etc.).
The data contains pew_data[, .N] total responses collected from live interviews on landlines and cell phones.
pred_files = list.files('~/github/bdr/pew-experiment/results/sim_randparams', pattern = '^party', full.names = T)
holdout_error = rbindlist(lapply(pred_files, function(f){
temp = fread(f)
holdout_ind = which(temp[model == 'logit',]$holdout == 1)
temp$act_class = rep(pew_data$support, length(unique(temp$model)))
temp[, pred_class := c('1-Dem', '2-Rep', '3-Oth')[apply(temp[, .(y_hat_dem, y_hat_rep, y_hat_oth)], 1, which.max)]]
temp[, correct_class := as.numeric(act_class == pred_class)]
holdout_error = cbind(temp[holdout == 1, .(y_hat_dem = mean(y_hat_dem)
, y_hat_rep = mean(y_hat_rep)
, y_hat_oth = mean(y_hat_oth)
, class_rate = mean(correct_class)
), by = .(model, results_id, match_rate, n_bags, n_landmarks, refit_bags, party)]
, pew_data[holdout_ind, .(y_dem = mean(y_dem)
, y_rep = mean(y_rep)
, y_oth = mean(y_oth)
)]
)
holdout_error[, y_hat_dem_2way := y_hat_dem/(1 - y_hat_oth)]
holdout_error[, error_dem := y_hat_dem - y_dem]
holdout_error[, error_rep := y_hat_rep - y_rep]
holdout_error[, error_oth := y_hat_oth - y_oth]
holdout_error[, error_dem_2way := y_hat_dem_2way - (y_dem/(1-y_oth))]
holdout_error
}))
The first plot below shows the distribution of test group MSEs from the 596 simulation runs. The variation in MSE for the benchmarks (logit_alldata, logit and groupmean) is mainly due to different test/train splits and to the variation in match rate and party, the rest of the settings remain constant across runs. However, there is additional variation in test group MSEs for the DR models due to the randomized hyperparameter values.
The second plot shows distribution of the bias in the estimated % Dem across simulation settings.
ggplot(mses, aes(x = model, y = mse )) + geom_boxplot() +
facet_grid(~party) + coord_flip() +
ggtitle("Distribution of MSE by model and party")
ggplot(holdout_error, aes(x = model, y = error_dem)) +
geom_hline(yintercept = 0, color = 'red') + facet_grid(~party) +
geom_boxplot() + coord_flip() +
ggtitle("Bias in Estimated % Dem")
Main takeaways:
model_subset = c('logit_alldata', 'logit', 'grpmean', 'dr','wdr','dr_sepbags')
ggplot(mses[model %in% model_subset], aes(x = match_rate,y = mse, color = model)) +
geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(~party) +
ggtitle("Holdout MSE by match rate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(holdout_error[model %in% model_subset], aes(x = match_rate,y = error_dem, color = model)) +
geom_point(alpha = 0.2) +
geom_smooth(se = F) +
facet_grid(~party) +
ggtitle("Holdout Dem bias by match rate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Main takeaways:
model_subset = c('logit_alldata','logit', 'grpmean', 'dr','wdr','dr_sepbags')
ggplot(mses[model %in% model_subset], aes(x = n_bags,y = mse, color = model)) +
facet_grid(~party, scales = 'free_y') +
geom_point(alpha = 0.2) +
geom_smooth() +
ggtitle("Holdout MSE by number of bags")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(holdout_error[model %in% model_subset], aes(x = n_bags,y = error_dem, color = model)) +
geom_point(alpha = 0.2) +
geom_smooth(se = F) +
facet_wrap(~party, scales = 'free_y') +
ggtitle("Holdout Dem bias by number of bags")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Fitting bags using just the unmatched data seems to improve model performance slightly for some numbers of bags (~80).
ggplot(mses[model %in% c('dr_sepbags', 'wdr_sepbags')], aes(x = n_bags,y = mse, color = refit_bags)) +
geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(party~model, scales = 'free') +
ggtitle("Holdout MSE by number of bags")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(holdout_error[model %in% c('dr_sepbags', 'wdr_sepbags')], aes(x = n_bags,y = error_dem, color = refit_bags)) +
geom_point(alpha = 0.2) +
geom_smooth() +
facet_grid(party~model, scales = 'free') +
ggtitle("Holdout Dem bias by number of bags")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
model_subset = c('logit', 'grpmean', 'dr','wdr','dr_sepbags')
ggplot(mses[model %in% model_subset], aes(x = n_landmarks,y = mse_relall, color = model)) +
geom_point(alpha = 0.2) +
geom_smooth() +
facet_wrap(~party, scales = 'free') +
ggtitle("Holdout MSE by match rate")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(holdout_error[model %in% model_subset], aes(x = n_landmarks,y = error_dem, color = model)) +
geom_point(alpha = 0.2) +
geom_smooth(se = F) +
facet_wrap(~party, scales = 'free') +
ggtitle("Holdout Dem bias by number of landmarks")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(mses[model %in% c('dr','wdr','dr_sepbags') & party == 'insurvey'], aes(x = n_landmarks,y = n_bags)) +
geom_point(aes(color = mse_relall), alpha = 0.6) +
#geom_smooth(se = F) +
facet_wrap(~model, scales = 'free') +
ggtitle("Holdout Dem bias by number of landmarks")
At this stage, it appears that DR is not an improvement on simple logistic regression with the subset of matched respondents with this data set.
Party is incredibly predictive of outcome, so how we treat it has an enormous impact on model performance and conclusions from this simulation. This level of predictiveness is not realistic in most settings, however, because in this case party ID is collected at the same time as support, and therefor contains almost the same information as the outcome. We don’t generally have voterfile data with that level of predictiveness.
Garreau, Damien, Wittawat Jitkrittum, and Motonbu Kanagawa. 2018. “Asymptotic Normality of the Median Heuristic,” no. 1. http://arxiv.org/abs/1707.07269.
Gretton, Arthur, Alex Smola, Jiayuan Huang, Marcel Schmittfull, Karsten Borgwardt, and Bernhard Schölkopf. 2013. “Covariate Shift by Kernel Mean Matching.” Dataset Shift in Machine Learning, 131–60. https://doi.org/10.7551/mitpress/9780262170055.003.0008.